It is necessary to make an additional dataset because not all articles has information about the authors. So NYT_clean_author will contain less articles than NYT_clean because articles with missing author information is dropped from the dataset.
Dataset with authors
First thing first, we start by loading the packages tidyverse (Wickham et al. 2019), DT (Xie, Cheng, and Tan 2021) and gender (Mullen 2020).
pacman::p_load(tidyverse, gender, DT)
We are going to make the author dataset NYT_clean_author from the dataset NYT_raw. It mostly follows the syntax from the package tidyverse.
#selecting coloumns to keep
coloumns_to_select <- c("response.docs.headline.main", "response.docs.web_url",
"response.docs.pub_date", "firstname", "lastname", "rank",)
#creating the new dataframe
NYT_aut <- NYT_raw %>%
# Information about the name of the author is stored inside a dataframe in a column of the dataframe all_articles. I use the function "unnest" to unpack the dataframe into separate columns
unnest(cols = response.docs.byline.person) %>%
#selecting the defined columns
select(coloumns_to_select) %>%
#renaming columns to more humane names
rename(
"headline" = "response.docs.headline.main",
"url" = "response.docs.web_url",
"date" = "response.docs.pub_date"
) %>%
#making a new coloumn with full name
mutate(
full_name = str_c(firstname, lastname, sep = "_")
) %>%
#formating coloumns to the correct class
mutate(
date = as.Date(date),
firstname = as.factor(firstname),
lastname = as.factor(lastname),
rank = as.factor(rank),
full_name = as.factor(full_name)
) %>%
#filtering rows where author is missing
filter(is.na(full_name) == F) %>%
#arranging by date so that the articles are in chronological order
arrange(by=date)
#saving to a csv
write_csv(NYT_aut, "data/new_york_times/data_additional/NYT_clean_author_cp18.csv")
Now lets inspect the cleaned dataframe. to see if it looks allright. Again, we use the function datatable from the package DT. The cleaned dataset contains the following columns.
headline which is the title/headline of the article
url which is a url leading to the the article on the NYT webpage
date which is the publication date
firstname which is the first name of the author of the article
lastname which is the last name of the author of the article
rank which indicates the rank of the author.
full_name which is the full name of the author
#reading the csv
NYT_aut <- read_csv("data/author_dataset/NYT_author.csv")
#making a nice dataframe that we can browse. Note that we remove abstract because there is too much text in it to show in a nice way.
font.size <- "8pt"
DT::datatable(
NYT_aut,
rownames = FALSE,
filter = "top",
options = list(
initComplete = htmlwidgets::JS(
"function(settings, json) {",
paste0("$(this.api().table().container()).css({'font-size': '", font.size, "'});"),
"}"),
pagelength = 3,
scrollX=T,
autoWidth = TRUE
)
)